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Descriptive Statistics: Introduction 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Display data graphically and interpret graphs: stemplots, histograms 
and boxplots. 

e Recognize, describe, and calculate the measures of location of data: 
quartiles and percentiles. 

e Recognize, describe, and calculate the measures of the center of data: 
mean, median, and mode. 

e Recognize, describe, and calculate the measures of the spread of data: 
variance, standard deviation, and range. 


Introduction 


Once you have collected data, what will you do with it? Data can be 
described and presented in many different formats. For example, suppose 
you are interested in buying a house in a particular area. You may have no 
clue about the house prices, so you might ask your real estate agent to give 
you a sample data set of prices. Looking at all the prices in the sample often 
is overwhelming. A better way might be to look at the median price and the 
variation of prices. The median and variation are just two ways that you 
will learn to describe data. Your agent might also provide you with a graph 
of the data. 


In this chapter, you will study numerical and graphical ways to describe and 
display your data. This area of statistics is called "Descriptive Statistics". 
You will learn to calculate, and even more importantly, to interpret these 
measurements and graphs. 


Descriptive Statistics: Displaying Data 
This module provides a brief introduction into the ways graphs and charts 
can be used to provide visual representations of data. 


A Statistical graph is a tool that helps you learn about the shape or 
distribution of a sample. The graph can be a more effective way of 
presenting data than a mass of numbers because we can see where data 
clusters and where there are only a few data values. Newspapers and the 
Internet use graphs to show trends and to enable readers to compare facts 
and figures quickly. 


Statisticians often graph data first to get a picture of the data. Then, more 
formal tools may be applied. 


Some of the types of graphs that are used to summarize and organize data 
are the dot plot, the bar chart, the histogram, the stem-and-leaf plot, the 
frequency polygon (a type of broken line graph), pie charts, and the 
boxplot. In this chapter, we will briefly look at stem-and-leaf plots, line 
graphs and bar graphs. Our emphasis will be on histograms and boxplots. 


Descriptive Statistics: Stem and Leaf Graphs (Stemplots) 
This module introduces the use of stem-and-leaf graphs (stemplots), line 
graphs and bar graphs for describing a set of data visually. 


One simple graph, the stem-and-leaf graph or stem plot, comes from the 
field of exploratory data analysis.It is a good choice for numerical data sets 
that are small. To create the plot, divide each observation of data into a stem 
and a leaf. The leaf consists of a final significant digit. For example, 23 
has stem 2 and leaf 3. Four hundred thirty-two (432) has stem 43 and leaf 2. 
Five thousand four hundred thirty-two (5,432) has stem 543 and leaf 2. The 
decimal 9.3 has stem 9 and leaf 3. Write the stems in a vertical line from 
smallest the largest. Draw a vertical line to the right of the stems. Then 
write the leaves in increasing order next to their corresponding stem. 


Example: 

For Susan Dean's spring pre-calculus class, scores for the first exam were 
as follows (smallest to largest): 
334249495355556163676868696972737478808388888890929494949496 
100 


Stem Leaf 
3 3 

4 299 
5 355 


6 1378899 


Stem Leaf 


7 2348 

8 03888 

9 0244446 
10 0 


Stem-and-Leaf Diagram 


The stem plot shows that most scores fell in the 60s, 70s, 80s, and 90s. 
Eight out of the 31 scores or approximately 26% of the scores were in the 
90's or 100, a fairly high number of As. 


The stem plot is a quick way to graph and gives an exact picture of the data. 
You want to look for an overall pattern and any outliers. An outlier is an 
observation of data that does not fit the rest of the data. It is sometimes 
called an extreme value. When you graph an outlier, it will appear not to fit 
the pattern of the graph. Some outliers are due to mistakes (for example, 
writing down 50 instead of 500) while others may indicate that something 
unusual is happening. It takes some background information to explain 
outliers. In the example above, there were no outliers. 


Example: 

Create a stem plot using the data: 
1.11.52.32.52.73.23.33.33.53.84.0 4.24.54.54.74.85.55.66.56.712.3 
The data are the distance (in kilometers) from a home to the nearest 
supermarket. 

Exercise: 


Problem: 


1. Are there any values that might possibly be outliers? 
2. Do the data seem to have any concentration of values? 


Note:The leaves are to the right of the decimal. 


Solution: 


The value 12.3 may be an outlier. Values appear to concentrate at 3 
and 4 kilometers. 


Stem Leaf 

1 15 

Z Sew 

3 23 35.0 
4 025578 
5 56 
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oy 


Stem Leaf 


11 


12 3 


Glossary 


Outlier 
An observation that does not fit the rest of the data. 


Descriptive Statistics: Line Graphs and Bar Graphs 
This module introduces the use of stem-and-leaf graphs (stemplots), line 
graphs and bar graphs for describing a set of data visually. 


Another type of graph that is useful for specific data values is a line graph. 
In the particular line graph shown in the example, the x-axis consists of 
data values and the y-axis consists of frequency points. The frequency 
points are connected. 


Example: 

Line Graphs 

In a survey, 40 mothers were asked how many times per week a teenager 
must be reminded to do his/her chores. The results are shown in the table 
and the line graph. 


Number of times teenager is reminded Frequency 
0 2 

1 5 

2 8 

3 14 

4 i 


Frequency 


0 1 2 3 = 5 6 


Number of Times Teenager is 
Reminded 


Bar graphs are useful for displaying categorical data. Bar graphs consist of 
bars that are separated from each other. The bars can be rectangles or they 
can be rectangular boxes and they can be vertical or horizontal. 


Example: 

Bar Graphs 

The columns in the table below contain the race/ethnicity of U.S. Public 
Schools: High School Class of 2011, percentages for the Advanced 
Placement Examinee Population for that class and percentages for the 
Overall Student Population. The 3-dimensional graph shows the 
Race/Ethnicity of U.S. Public Schools (qualitative data) on the x-axis and 
Advanced Placement Examinee Population percentages on the y-axis. 
(Source: http://www.collegeboard.com and Source: 
http://apreport.collegeboard.org/goals-and-findings/promoting-equity) 


Race/Ethnicity 
1 = Asian, Asian 
American or Pacific 


Islander 


2 = Black or African 
American 


3 = Hispanic or Latino 


4 = American Indian or 
Alaska Native 


5 = White 


6 = Not reported/other 


AP Examinee 
Population 


10.3% 


9.0% 


17.0% 


0.6% 


D7 Lye 


6.0% 


Overall 
Student 
Population 


D776 


14.7% 


17.6% 


1.1% 


59.2% 


1.7% 


Ethnicity/Race vs. Percent of AP 
Examinees 


57.1 


17 
10.3 9 
6 
| a a 
1 2 3 4 5 


6 


Go to Outcomes of Education Figure 22 for an example of a bar graph that 
shows unemployment rates of persons 25 years and older for 2009. 


Note:This book contains instructions for constructing a histogram and a 
box plot for the TI-83+ and TI-84 calculators. You can find additional 
instructions for using these calculators on the Texas Instruments (TT) 
website. 


Glossary 


Outlier 
An observation that does not fit the rest of the data. 


Descriptive Statistics: Histogram 

This module provides an overview of Descriptive Statistics: Histogram as a 
part of Collaborative Statistics collection (col10522) by Barbara Illowsky 
and Susan Dean. 


For most of the work you do in this book, you will use a histogram to 
display the data. One advantage of a histogram is that it can readily display 
large data sets. A rule of thumb is to use a histogram when the data set 
consists of 100 values or more. 


A histogram consists of contiguous boxes (the boxes touch, unlike in a bar 
graph). It has both a horizontal axis and a vertical axis. The horizontal axis 
is labeled with what the data represents (for instance, distance from your 
home to school). The vertical axis is labeled either Frequency or relative 
frequency. The graph will have the same shape with either label. The 
histogram (like the stemplot) can give you the shape of the data, the center, 
and the spread of the data. (The next section tells you how to calculate the 
center and the spread.) 


The relative frequency is equal to the frequency for an observed value of 
the data divided by the total number of data values in the sample. (In the 
chapter on Sampling and Data, we defined frequency as the number of 
times an answer occurs.) If: 


e f = frequency 

e n = total number of data values (or the sum of the individual 
frequencies), and 

e RF = relative frequency, 


then: 
Equation: 


RF = 2 
n 


For example, if 3 students in Mr. Ahab's English class of 40 students 
received from 90% to 100%, then, 


a = Gol tees Cola ce. 
f=3,n=40,andRF= — Go = 0.075 


Seven and a half percent of the students received 90% to 100%. Ninety 
percent to 100 % are quantitative measures. 


To construct a histogram, first decide how many bars or intervals, also 
called classes, represent the data. Many histograms consist of from 5 to 15 
bars or classes for clarity. Choose a starting point for the first interval to be 
less than the smallest data value. 


Example: 

The following data are the heights (in inches to the nearest half inch) of 
100 male semiprofessional soccer players. The heights are continuous data 
since height is measured. 

60 60.5 61 61 61.5 
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The smallest data value is 60. We will use that as our starting point. 


Note: We will make each bar or class interval 2 units wide. 


The boundaries are: 


e 60 
¢ 60+ 2=62 


e 62+2=64 


e 64+2=66 
¢ 66+ 2=68 
e 68+ 2=70 
e 70+ 2=72 
e 72+2=74 
e 74+2=76 


The heights 60 through 61.5 inches are in the interval 60 - 62. The heights 
that are 63.5 are in the interval 62 - 64. The heights that are 64 through 
64.5 are in the interval 64 - 66. The heights 66 through 67.5 are in the 
interval 66 - 68. The heights 68 through 69.5 are in the interval 68 - 70. 
The heights 70 through 71 are in the interval 70 -72. The heights 72 
through 73.5 are in the interval 72 - 74. The height 74 is in the interval 74 - 
VAS 

The following histogram displays the heights on the x-axis and relative 
frequency on the y-axis. 


Relative 
Frequency 


0.4 


Example: 


The following data are the number of books bought by 50 part-time college 
students at ABC College. The number of books is discrete data since books 
are counted. 

SU Ey Ua a Bs Ws sa 
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66 

Eleven students buy 1 book. Ten students buy 2 books. Sixteen students 
buy 3 books. Six students buy 4 books. Five students buy 5 books. Two 
students buy 6 books. 

Because the data are integers, subtract 0.5 from 1, the smallest data value 
and add 0.5 to 6, the largest data value. Then the starting point is 0.5 and 
the ending value is 6.5. 

Exercise: 


Problem: 


Next, calculate the width of each bar or class interval. If the data are 
discrete and there are not too many different values, a width that 
places the data values in the middle of the bar or class interval is the 
most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6 
and the starting point is 0.5, a width of one places the 1 in the middle 
of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 
1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in 
the middle of the interval from to , the 5 in the 
middle of the interval from to , and the in 
the middle of the interval from to 


Solution: 


e 3.5t0 4.5 
e 4.50 5.5 
° 6 

* 3,0110:6,5 


Calculate the number of bars as follows: 


Equation: 
6.5—0.5 _ 1 
bars 


where 1 is the width of a bar. Therefore, bars = 6. 
The following histogram displays the number of books on the x-axis and 


the frequency on the y-axis. 
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Number of Books 


Using the TI-83, 83+, 84, 84+ Calculator Instructions 

Go to the Appendix (14:Appendix) in the menu on the left. There are 
calculator instructions for entering data and for creating a customized 
histogram. Create the histogram for Example 2. 


e Press Y=. Press CLEAR to clear out any equations. 

e Press STAT 1:EDIT. If L1 has data in it, arrow up into the name L1, 
press CLEAR and arrow down. If necessary, do the same for L2. 

e Into L1, enter 1, 2, 3, 4,5,6 

e Into L2, enter 11, 10, 16, 6, 5, 2 


e Press WINDOW. Make Xmin = .5, Xmax = 6.5, Xscl = (6.5 - .5)/6, 
Ymin = -1, Ymax = 20, Yscl = 1, Xres = 1 

e Press 2nd Y=. Start by pressing 4:Plotsoff ENTER. 

e Press 2nd Y=. Press 1:Plot1. Press ENTER. Arrow down to TYPE. 
Arrow to the 3rd picture (histogram). Press ENTER. 

e Arrow down to Xlist: Enter L1 (2nd 1). Arrow down to Freq. Enter L2 
(2nd 2). 

e Press GRAPH 

e Use the TRACE key and the arrow keys to examine the histogram. 


Optional Collaborative Exercise 


Count the money (bills and change) in your pocket or purse. Your instructor 
will record the amounts. As a class, construct a histogram displaying the 
data. Discuss how many intervals you think is appropriate. You may want to 
experiment with the number of intervals. Discuss, also, the shape of the 
histogram. 


Record the data, in dollars (for example, 1.25 dollars). 


Construct a histogram. 


Glossary 


Frequency 
The number of times a value of the data occurs. 


Relative Frequency 
The ratio of the number of times a value of the data occurs in the set of 
all outcomes to the number of all outcomes. 


Descriptive Statistics: Box Plot 


Box plots or box-whisker plots give a good graphical image of the 
concentration of the data. They also show how far from most of the data the 
extreme values are. The box plot is constructed from five values: the 
smallest value, the first quartile, the median, the third quartile, and the 
largest value. The median, the first quartile, and the third quartile will be 
discussed here, and then again in the section on measuring data in this 
chapter. We use these values to compare how close other data values are to 
them. 


The median, a number, is a way of measuring the "center" of the data. You 
can think of the median as the "middle value," although it does not actually 
have to be one of the observed values. It is a number that separates ordered 
data into halves. Half the values are the same number or smaller than the 
median and half the values are the same number or larger. For example, 
consider the following data: 


111.567.248910688322101 
Ordered from smallest to largest: 
11224668 7.288.39101011.5 


The median is between the 7th value, 6.8, and the 8th value 7.2. To find the 
median, add the two values together and divide by 2. 
Equation: 


6847.2 _ 


7 
2 


The median is 7. Half of the values are smaller than 7 and half of the values 
are larger than 7. 


Quartiles are numbers that separate the data into quarters. Quartiles may or 
may not be part of the data. To find the quartiles, first find the median or 
second quartile. The first quartile is the middle value of the lower half of 


the data and the third quartile is the middle value of the upper half of the 
data. To get the idea, consider the same data set shown above: 


112246687.28839101011.5 


The median or second quartile is 7. The lower half of the data is 1, 1, 2, 2, 
4, 6, 6.8. The middle value of the lower half is 2. 


11224668 


The number 2, which is part of the data, is the first quartile. One-fourth of 
the values are the same or less than 2 and three-fourths of the values are 
more than 2. 


The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of 
the upper half is 9. 


7.288.39 1010 11.5 


The number 9, which is part of the data, is the third quartile. Three-fourths 
of the values are less than 9 and one-fourth of the values are more than 9. 


To construct a box plot, use a horizontal number line and a rectangular box. 
The smallest and largest data values label the endpoints of the axis. The first 
quartile marks one end of the box and the third quartile marks the other end 
of the box. The middle fifty percent of the data fall inside the box. The 
"whiskers" extend from the ends of the box to the smallest and largest data 
values. The box plot gives a good quick picture of the data. 


Note: You may encounter box and whisker plots that have dots marking 
outlier values. In those cases, the whiskers are not extending to the 
minimum and maximum values. 


Consider the following data: 


112246687.28839101011.5 


The first quartile is 2, the median is 7, and the third quartile is 9. The 
smallest value is 1 and the largest value is 11.5. The box plot is constructed 
as follows (see calculator instructions in the back of this book or on the TI 
web site): 


1 2 K - 5 6 7 8 9 10 11 11.5 


The two whiskers extend from the first quartile to the smallest value and 
from the third quartile to the largest value. The median is shown with a 
dashed line. 


Example: 

The following data are the heights of 40 students in a statistics class. 

59 60 61 62 62 63 63 64 64 64 65 65 65 65 65 65 65 65 65 66 66 67 67 68 
08 09! 70 70°70 70 70 71 71-72 72 73 7474 Ja 77 

Construct a box plot: 

Using the TI-83, 83+, 84, 84+ Calculator 


e Enter data into the list editor (Press STAT 1:EDIT). If you need to 
clear the list, arrow up to the name L1, press CLEAR, arrow down. 

e Put the data values in list L1. 

e Press STAT and arrow to CALC. Press 1:1-VarStats. Enter L1. 

e Press ENTER 

e Use the down and up arrow keys to scroll. 


e Smallest value = 59 
e Largest value = 77 
e Q1: First quartile = 64.5 


e Q2: Second quartile or median= 66 
¢ Q3: Third quartile = 70 


Using the TI-83, 83+, 84, 84+ to Construct the Box Plot 
Go to 14:Appendix for Notes for the TI-83, 83+, 84, 84+ Calculator. To 
create the box plot: 


e Press Y=. If there are any equations, press CLEAR to clear them. 

e Press 2nd Y=. 

e Press 4:Plotsoff. Press ENTER 

e Press 2nd Y= 

e Press 1:Ploti. Press ENTER. 

e Arrow down and then use the right arrow key to go to the 5th picture 
which is the box plot. Press ENTER. 

e Arrow down to Xlist: Press 2nd 1 for L1 

e Arrow down to Freq: Press ALPHA. Press 1. 

e Press ZOOM. Press 9:ZoomStat. 

e Press TRACE and use the arrow keys to examine the box plot. 


59 645 66 70 17 


e aEach quarter has 25% of the data. 

e bThe spreads of the four quarters are 64.5 - 59 = 5.5 (first quarter), 66 
- 64.5 = 1.5 (second quarter), 70 - 66 = 4 (3rd quarter), and 77 - 70 = 
7 (fourth quarter). So, the second quarter has the smallest spread and 
the fourth quarter has the largest spread. 

e cinterquartile Range: IQR = Q3 — Q1 = 70 — 64.5 = 5.5. 

e dThe interval 59 through 65 has more than 25% of the data so it has 
more data in it than the interval 66 through 70 which has 25% of the 
data. 

e eThe middle 50% (middle half) of the data has a range of 5.5 inches. 


For some sets of data, some of the largest value, smallest value, first 
quartile, median, and third quartile may be the same. For instance, you 
might have a data set in which the median and the third quartile are the 
same. In this case, the diagram would not have a dotted line inside the box 
displaying the median. The right side of the box would display both the 
third quartile and the median. For example, if the smallest value and the 
first quartile were both 1, the median and the third quartile were both 5, 
and the largest value was 7, the box plot would look as follows: 


Example: 

Test scores for a college statistics class held during the day are: 

99 56 78 55.5 32 90 80 81 56 59 45 77 84.5 84 70 72 68 32 79 90 

Test scores for a college statistics class held during the evening are: 

98 78 68 83 81 89 88 76 65 45 98 90 80 84.5 85 79 78 98 90 79 81 25.5 
Exercise: 


Problem: 


e What are the smallest and largest data values for each data set? 

e¢ What is the median, the first quartile, and the third quartile for 
each data set? 

e Create a boxplot for each set of data. 

e¢ Which boxplot has the widest spread for the middle 50% of the 
data (the data between the first and third quartiles)? What does 
this mean for that set of data in comparison to the other set of 
data? 


e For each data set, what percent of the data is between the 
smallest value and the first quartile? (Answer: 25%) the first 
quartile and the median? (Answer: 25%) the median and the third 
quartile? the third quartile and the largest value? What percent of 
the data is between the first quartile and the largest value? 
(Answer: 75%) 


Solution: 
First Data Set 


e Xmin = 32 
«Oil 56 

e M = 74.5 
* (O3— 82:5 
e Xmax = 99 


Second Data Set 

e Xmin = 25.5 
Or 71s 
M=8l 


Q3 = 89 
Xmax = 98 


—L_i FF — 


20 30 40 S50 60 70 80 90 100 


The first data set (the top box plot) has the widest spread for the middle 
50% of the data. IQR = Q3 — Q1 is 82.5 — 56 = 26.5 for the first data 
set and 89 — 78 = 11 for the second data set. So, the first set of data has 
its middle 50% of scores more spread out. 

25% of the data is between M and Q3 and 25% is between Q3 and Xmax. 


Glossary 


Median 
A number that separates ordered data into halves. Half the values are 
the same number or smaller than the median and half the values are the 
same number or larger than the median. The median may or may not 
be part of the data. 


Quartiles 
The numbers that separate the data into quarters. Quartiles may or may 
not be part of the data. The second quartile is the median of the data. 


Descriptive Statistics: Measuring the Location of the Data 

Descriptive Statistics: Measuring the Location of Data explains percentiles and 
quartiles and is part of the collection col10555 written by Barbara Illowsky and 
Susan Dean. Roberta Bloom contributed the section "Interpreting Percentiles, 
Quartile and the Median." 


The common measures of location are quartiles and percentiles (%iles). Quartiles 
are special percentiles. The first quartile, Q ; is the same as the 25th percentile 
(25th %ile) and the third quartile, @ 3, is the same as the 75th percentile (75th 
%ile). The median, M, is called both the second quartile and the 50th percentile 
(50th %ile). 


Note: Quartiles are given special attention in the Box Plots module in this chapter. 


To calculate quartiles and percentiles, the data must be ordered from smallest to 
largest. Recall that quartiles divide ordered data into quarters. Percentiles divide 
ordered data into hundredths. To score in the 90th percentile of an exam does not 
mean, necessarily, that you received 90% on a test. It means that 90% of test 
scores are the same or less than your score and 10% of the test scores are the same 
or greater than your test score. 


Percentiles are useful for comparing values. For this reason, universities and 
colleges use percentiles extensively. 


Percentiles are mostly used with very large populations. Therefore, if you were to 
say that 90% of the test scores are less (and not the same or less) than your score, 
it would be acceptable because removing one particular data value is not 
significant. 


The interquartile range is a number that indicates the spread of the middle half 
or the middle 50% of the data. It is the difference between the third quartile (Q 3) 
and the first quartile (Q 1). 

Equation: 


IQR = Q3— Qi 


The IQR can help to determine potential outliers. A value is suspected to be a 
potential outlier if it is less than (1.5)(IQR) below the first quartile or more 
than (1.5)(IQR) above the third quartile. Potential outliers always need further 
investigation. 


Example: 
Exercise: 


Problem: 

For the following 13 real estate prices, calculate the IQR and determine if 
any prices are outliers. Prices are in dollars. (Source: San Jose Mercury 
News) 


389,950 230,500 158,000 479,000 639,000 114,950 5,500,000 387,000 
659,000 529,000 575,000 488,800 1,095,000 


Solution: 


Order the data from smallest to largest. 


114,950 158,000 230,500 387,000 389,950 479,000 488,800 529,000 
975,000 639,000 659,000 1,095,000 5,500,000 


M = 488,800 

Qy = 20800387000 — 308750 

Qz = $82000+859000 — 649000 

IQR = 649000 — 308750 = 340250 

(1.5)(IQR) = (1.5)(340250) = 510375 

Qi — (1.5)(IQR) = 308750 — 510375 = —201625 
Q3 + (1.5)(IQR) = 649000 + 510375 = 1159375 


No house price is less than -201625. However, 5,500,000 is more than 
1,159,375. Therefore, 5,500,000 is a potential outlier. 


Example: 
Exercise: 


Problem: 
For the two data sets in the test scores example, find the following: 


e aThe interquartile range. Compare the two interquartile ranges. 

e bAny outliers in either set. 

¢ cThe 30th percentile and the 80th percentile for each set. How much 
data falls below the 30th percentile? Above the 80th percentile? 


Solution: 


For the IQRs, see the answer to the test scores example. The first data set has 
the larger IQR, so the scores between Q3 and Q1 (middle 50%) for the first 
data set are more spread out and not clustered about the median. 


First Data Set 


¢ (2) - (IQR) = (2) - (26.5) = 39.75 
e Xmax - Q3 = 99 - 82.5 = 1 


(3) : (IQR) = 39.75 is larger than 16.5 and larger than 24, so the first set 
has no outliers. 


Second Data Set 
(+) - (IQR) = (>) - (11) = 16.5 
e Xmax — Q3 = 98 — 89 = 9 
e QI — Xmin = 78 — 25.5 = 52.5 


= : (IQR) = 16.5 is larger than 9 but smaller than 52.5, so for the 
second set 45 and 25.5 are outliers. 


To find the percentiles, create a frequency, relative frequency, and 


Data Chapter). Get the percentiles from that chart. 
First Data Set 


¢ 30th %ile (between the 6th and 7th values) = pote) = iH 


80th %ile (between the 16th and 17th values) = en) = 84.25 
Second Data Set 


¢ 30th %ile (7th value) = 78 
° 80th %ile (18th value) = 90 


30% of the data falls below the 30th %ile, and 20% falls above the 80th 
%ile. 


Example: 

Finding Quartiles and Percentiles Using a Table 

Fifty statistics students were asked how much sleep they get per school night 
(rounded to the nearest hour). The results were (student data): 


AMOUNT 

OF 

SLEEP 

PER 

SCHOOL CUMULATIVE 
NIGHT RELATIVE RELATIVE 
(HOURS) FREQUENCY FREQUENCY FREQUENCY 


4 2 0.04 0.04 


fs) fs) 0.10 0.14 


AMOUNT 


OF 

SLEEP 

PER 

SCHOOL CUMULATIVE 
NIGHT RELATIVE RELATIVE 


(HOURS) FREQUENCY FREQUENCY FREQUENCY 


6 7 0.14 0.28 
ye 12 0.24 0.52 
8 14 0.28 0.80 
9 7. 0.14 0.94 
10 3 0.06 1.00 


Find the 28th percentile: Notice the 0.28 in the "cumulative relative frequency" 
column. 28% of 50 data values = 14. There are 14 values less than the 28th %ile. 
They include the two 4s, the five 5s, and the seven 6s. The 28th %ile is between 
the last 6 and the first 7. The 28th %ile is 6.5. 

Find the median: Look again at the "cumulative relative frequency " column and 
find 0.52. The median is the 50th %ile or the second quartile. 50% of 50 = 25. 
There are 25 values less than the median. They include the two 4s, the five 5s, the 
seven 6s, and eleven of the 7s. The median or 50th %ile is between the 25th (7) 
and 26th (7) values. The median is 7. 

Find the third quartile: The third quartile is the same as the 75th percentile. You 
can "eyeball" this answer. If you look at the "cumulative relative frequency" 
column, you find 0.52 and 0.80. When you have all the 4s, 5s, 6s and 7s, you 
have 52% of the data. When you include all the 8s, you have 80% of the data. 
The 75th “%ile, then, must be an 8 . Another way to look at the problem is to 
find 75% of 50 (= 37.5) and round up to 38. The third quartile, Q 3, is the 38th 
value which is an 8. You can check this answer by counting the values. (There are 
37 values below the third quartile and 12 values above.) 


Example: 


Exercise: 


Problem: Using the table: 


1. Find the 80th percentile. 

2. Find the 90th percentile. 

3. Find the first quartile. 

4. What is another name for the first quartile? 


Solution: 


pe = BE 
5 = 8. 
Look where cum. rel. freg. = 0.80. 80% of the data is 8 or less. 80th 
%ile is between the last 8 and first 9. 
Das 
Sie) 
4. First Quartile = 25th %ile 


Collaborative Classroom Exercise: Your instructor or a member of the class will 
ask everyone in class how many sweaters they own. Answer the following 
questions. 


1. How many students were surveyed? 

2. What kind of sampling did you do? 

3. Construct a table of the data. 

4. Construct 2 different histograms. For each, starting value = ending 
value= 

5. Use the table to find the median, first quartile, and third quartile. 

6. Construct a box plot. 

7. Use the table to find the following: 


o The 10th percentile 
o The 70th percentile 
o The percent of students who own less than 4 sweaters 


Interpreting Percentiles, Quartiles, and Median 


A percentile indicates the relative standing of a data value when data are sorted 
into numerical order, from smallest to largest. p% of data values are less than or 
equal to the pth percentile. For example, 15% of data values are less than or equal 
to the 15th percentile. 


¢ Low percentiles always correspond to lower data values. 
e High percentiles always correspond to higher data values. 


A percentile may or may not correspond to a value judgment about whether it is 
"good" or "bad". The interpretation of whether a certain percentile is good or bad 
depends on the context of the situation to which the data applies. In some 
situations, a low percentile would be considered "good'; in other contexts a high 
percentile might be considered "good". In many situations, there is no value 
judgment that applies. 


Understanding how to properly interpret percentiles is important not only when 
describing data, but is also important in later chapters of this textbook when 
calculating probabilities. 


Guideline: 


When writing the interpretation of a percentile in the context of the given data, the 
sentence should contain the following information: 


e information about the context of the situation being considered, 

e the data value (value of the variable) that represents the percentile, 

e the percent of individuals or items with data values below the percentile. 

e Additionally, you may also choose to state the percent of individuals or items 
with data values above the percentile. 


Example: 
On a timed math test, the first quartile for times for finishing the exam was 35 
minutes. Interpret the first quartile in the context of this situation. 


e 25% of students finished the exam in 35 minutes or less. 

e 75% of students finished the exam in 35 minutes or more. 

e A low percentile could be considered good, as finishing more quickly on a 
timed exam is desirable. (If you take too long, you might not be able to 
finish.) 


Example: 
On a 20 question math test, the 70th percentile for number of correct answers was 
16. Interpret the 70th percentile in the context of this situation. 


¢ 70% of students answered 16 or fewer questions correctly. 

e 30% of students answered 16 or more questions correctly. 

e Note: A high percentile could be considered good, as answering more 
questions correctly is desirable. 


Example: 

At a certain community college, it was found that the 30th percentile of credit 
units that students are enrolled for is 7 units. Interpret the 30th percentile in the 
context of this situation. 


¢ 30% of students are enrolled in 7 or fewer credit units 

e 70% of students are enrolled in 7 or more credit units 

e In this example, there is no "good" or "bad" value judgment associated with 
a higher or lower percentile. Students attend community college for varied 
reasons and needs, and their course load varies according to their needs. 


Do the following Practice Problems for Interpreting Percentiles 
Exercise: 


Problem: 


e a For runners in arace, a low time means a faster run. The winners in a 
race have the shortest running times. Is it more desirable to have a finish 
time with a high or a low percentile when running a race? 

¢ b The 20th percentile of run times in a particular race is 5.2 minutes. 
Write a sentence interpreting the 20th percentile in the context of the 
situation. 


e cA bicyclist in the 90th percentile of a bicycle race between two towns 
completed the race in 1 hour and 12 minutes. Is he among the fastest or 
slowest cyclists in the race? Write a sentence interpreting the 90th 
percentile in the context of the situation. 


Solution: 


e a Forrunners in a race it is more desirable to have a low percentile for 
finish time. A low percentile means a short time, which is faster. 

¢ bINTERPRETATION: 20% of runners finished the race in 5.2 minutes 
or less. 80% of runners finished the race in 5.2 minutes or longer. 

¢ cHe is among the slowest cyclists (90% of cyclists were faster than 
him.) INTERPRETATION: 90% of cyclists had a finish time of 1 hour, 
12 minutes or less.Only 10% of cyclists had a finish time of 1 hour, 12 
minutes or longer 


Exercise: 


Problem: 


e a For runners in a race, a higher speed means a faster run. Is it more 
desirable to have a speed with a high or a low percentile when running a 
race? 

e bThe 40th percentile of speeds in a particular race is 7.5 miles per hour. 
Write a sentence interpreting the 40th percentile in the context of the 


situation. 


Solution: 


e aFor runners in a race it is more desirable to have a high percentile for 
speed. A high percentile means a higher speed, which is faster. 

¢ bINTERPRETATION: 40% of runners ran at speeds of 7.5 miles per 
hour or less (slower). 60% of runners ran at speeds of 7.5 miles per hour 
or more (faster). 


Exercise: 


Problem: 


On an exam, would it be more desirable to earn a grade with a high or low 
percentile? Explain. 


Solution: 
On an exam you would prefer a high percentile; higher percentiles 
correspond to higher grades on the exam. 

Exercise: 
Problem: 
Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait 
time of 32 minutes is the 85th percentile of wait times. Is that good or bad? 


Write a sentence interpreting the 85th percentile in the context of this 
situation. 


Solution: 


When waiting in line at the DMV, the 85th percentile would be a long wait 
time compared to the other people waiting. 85% of people had shorter wait 
times than you did. In this context, you would prefer a wait time 
corresponding to a lower percentile. INTERPRETATION: 85% of people at 
the DMV waited 32 minutes or less. 15% of people at the DMV waited 32 
minutes or longer. 


Exercise: 
Problem: 
In a survey collecting data about the salaries earned by recent college 


graduates, Li found that her salary was in the 78th percentile. Should Li be 
pleased or upset by this result? Explain. 


Solution: 


Li should be pleased. Her salary is relatively high compared to other recent 
college grads. 78% of recent college graduates earn less than Li does. 22% of 
recent college graduates earn more than Li does. 


Exercise: 


Problem: 


In a study collecting data about the repair costs of damage to automobiles in a 
certain type of crash tests, a certain model of car had $1700 in damage and 
was in the 90th percentile. Should the manufacturer and/or a consumer be 
pleased or upset by this result? Explain. Write a sentence that interprets the 
90th percentile in the context of this problem. 


Solution: 


The manufacturer and the consumer would be upset. This is a large repair 
cost for the damages, compared to the other cars in the sample. 
INTERPRETATION: 90% of the crash tested cars had damage repair costs of 
$1700 or less; only 10% had damage repair costs of $1700 or more. 


Exercise: 


Problem: 


e The University of California has two criteria used to set admission 
standards for freshman to be admitted to a college in the UC system: 

e a. Students' GPAs and scores on standardized tests (SATs and ACTs) are 
entered into a formula that calculates an "admissions index" score. The 
admissions index score is used to set eligibility standards intended to 
meet the goal of admitting the top 12% of high school students in the 
state. In this context, what percentile does the top 12% represent? 

e b. Students whose GPAs are at or above the 96th percentile of all 
students at their high school are eligible (called eligible in the local 
context), even if they are not in the top 12% of all students in the state. 
What percent of students from each high school are "eligible in the local 
context"? 


Solution: 
¢ aThe top 12% of students are those who are at or above the 88th 
percentile of admissions index scores. 


¢ b The top 4% of students' GPAs are at or above the 96th percentile, 
making the top 4% of students "eligible in the local context". 


Exercise: 


Problem: 


Suppose that you are buying a house. You and your realtor have determined 
that the most expensive house you can afford is the 34th percentile. The 34th 
percentile of housing prices is $240,000 in the town you want to move to. In 
this town, can you afford 34% of the houses or 66% of the houses? 


Solution: 


You can afford 34% of houses. 66% of the houses are too expensive for your 
budget. INTERPRETATION: 34% of houses cost $240,000 or less. 66% of 
houses cost $240,000 or more. 


**With contributions from Roberta Bloom 


Glossary 


Interquartile Range (IRQ) 
The distance between the third quartile (Q3) and the first quartile (Q1). IQR 
= Q3 - Ql. 


Outlier 
An observation that does not fit the rest of the data. 


Percentile 
A number that divides ordered data into hundredths. 


Example: 
Let a data set contain 200 ordered observations starting with 


{2.3,2.7,2.8,2.9,2.9,3.0...}. Then the first percentile is @7?*) — 9.75, 


because 1% of the data is to the left of this point on the number line and 99% of 


hae! eho 0n 2: ; 
the data is on its right. The second percentile is ( = *) — 2.9. Percentiles may 


or may not be part of the data. In this example, the first percentile is not in the 
data, but the second percentile is. The median of the data is the second quartile 
and the 50th percentile. The first and third quartiles are the 25th and the 75th 
percentiles, respectively. 


Quartiles 
The numbers that separate the data into quarters. Quartiles may or may not be 
part of the data. The second quartile is the median of the data. 


Descriptive Statistics: Measuring the Center of the Data 
This chapter discusses measuring descriptive statistical information using the 
center of the data 


The "center" of a data set is also a way of describing location. The two most 
widely used measures of the "center" of the data are the mean (average) and the 
median. To calculate the mean weight of 50 people, add the 50 weights 
together and divide by 50. To find the median weight of the 50 people, order 
the data and find the number that splits the data into two equal parts (previously 
discussed under box plots in this chapter). The median is generally a better 
measure of the center when there are extreme values or outliers because it is not 
affected by the precise numerical values of the outliers. The mean is the most 
common measure of the center. 


Note:The words "mean" and "average" are often used interchangeably. The 
substitution of one word for the other is common practice. The technical term 
is "arithmetic mean" and "average" is technically a center location. However, 
in practice among non-statisticians, "average" is commonly accepted for 
"arithmetic mean.” 


The mean can also be calculated by multiplying each distinct value by its 
frequency and then dividing the sum by the total number of data values. The 
letter used to represent the sample mean is an x with a bar over it (pronounced 
"e Dal ix, 


The Greek letter js (pronounced "mew" ) represents the population mean. One 
of the requirements for the sample mean to be a good estimate of the population 
mean is for the sample taken to be truly random. 


To see that both ways of calculating the mean are the same, consider the 
sample: 


11122344444 
Equation: 


1+14+1+24+24+34+44+4+4+4+4 
OS SSS eS ee eS SS 
11 


2.7 


Equation: 


3x14+2x24+1x3+5x4 
O————————— 
11 


2.0 


In the second calculation for the sample mean, the frequencies are 3, 2, 1, and 
De 

You can quickly find the location of the median by using the expression — 
The letter n is the total number of data values in the sample. If n is an odd 
number, the median is the middle value of the ordered data (ordered smallest to 
largest). If n is an even number, the median is equal to the two middle values 
added together and divided by 2 after the data has been ordered. For example, if 
the total number of data values is 97, then oa = —— = 49. The median is the 
49th value in the ordered data. If the total number of data values is 100, then 
igs = we = 50.5. The median occurs midway between the 50th and 51st 
values. The location of the median and the value of the median are not the 
same. The upper case letter (V is often used to represent the median. The next 
example illustrates the location of the median and the value of the median. 


Example: 
Exercise: 


Problem: 


AIDS data indicating the number of months an AIDS patient lives after 
taking a new antibody drug are as follows (smallest to largest): 


3488 1011121314151516161717182122222424252626272729293132333 
33434353740444447 


Calculate the mean and the median. 


Solution: 


The calculation for the mean is: 


= 23.6 


ia [3+4+(8)(2)+10+11+12+13+414+(15)(2)+(16)(2)+...+35+37+40+(44) (2)+47| 
<4 40 


To find the median, M, first use the formula for the location. The location 
is: 


far —. sel __ 
oR 


Starting at the smallest value, the median is located between the 20th and 
21st values (the two 24s): 


34881011121314151516161717182122222424 
25262627272929313233333434353740444447 


M= — =n 


The median is 24. 


Using the TI-83,83+,84, 84+ Calculators 
Calculator Instructions are located in the menu item 14:Appendix (Notes for 
the TI-83, 83+, 84, 84+ Calculators). 


e Enter data into the list editor. Press STAT 1:EDIT 

e Put the data values in list L1. 

e Press STAT and arrow to CALC. Press 1:1-VarStats. Press 2nd 1 for L1 
and ENTER. 

e Press the down and up arrow keys to scroll. 


ec = 23.6, M = 24 


Example: 
Exercise: 


Problem: 
Suppose that, in a small town of 50 people, one person earns $5,000,000 


per year and the other 49 each earn $30,000. Which is the better measure 
of the "center," the mean or the median? 


Solution: 


= es ae — 129400 
M = 30000 


(There are 49 people who earn $30,000 and one person who earns 
$5,000,000.) 


The median is a better measure of the "center" than the mean because 49 
of the values are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. 
The 30,000 gives us a better sense of the middle of the data. 


Another measure of the center is the mode. The mode is the most frequent 
value. If a data set has two values that occur the same number of times, then the 
set is bimodal. 


Example: 

Statistics exam scores for 20 students are as follows 
Statistics exam scores for 20 students are as follows: 
5055.59 59°63 63 72 72 72 72 72:76 7861 63:64 6484 90153 
Exercise: 


Problem:Find the mode. 
Solution: 


The most frequent score is 72, which occurs five times. Mode = 72. 


Example: 
Five real estate exam scores are 430, 430, 480, 480, 495. The data set is 
bimodal because the scores 430 and 480 each occur twice. 


When is the mode the best measure of the "center"? Consider a weight loss 
program that advertises a mean weight loss of six pounds the first week of the 
program. The mode might indicate that most people lose two pounds the first 
week, making the program less appealing. 


Note:The mode can be calculated for qualitative data as well as for 
quantitative data. 


Statistical software will easily calculate the mean, the median, and the mode. 
Some graphing calculators can also make these calculations. In the real world, 
people make these calculations using software. 


The Law of Large Numbers and the Mean 


The Law of Large Numbers says that if you take samples of larger and larger 
size from any population, then the mean z of the sample is very likely to get 
closer and closer to p. This is discussed in more detail in The Central Limit 
Theorem. 


Note:The formula for the mean is located in the Summary of Formulas section 
course. 


Sampling Distributions and Statistic of a Sampling Distribution 


You can think of a sampling distribution as a relative frequency distribution 
with a great many samples. (See Sampling and Data for a review of relative 
frequency). Suppose thirty randomly selected students were asked the number 
of movies they watched the previous week. The results are in the relative 
frequency table shown below. 


# of movies Relative Frequency 


0 


5/30 


15/30 


6/30 


4/30 


1/30 


If you let the number of samples get very large (say, 300 million or more), 
the relative frequency table becomes a relative frequency distribution. 


A statistic is a number calculated from a sample. Statistic examples include the 
mean, the median and the mode as well as others. The sample mean z is an 
example of a statistic which estimates the population mean . 


Glossary 


Mean 


A number that measures the central tendency. A common name for mean 
is ‘average.’ The term 'mean' is a shortened form of ‘arithmetic mean.' By 


definition, the mean for a sample (denoted by 2) is 


Sum of all values in th 1 , 
ugbecot oslucsin theasmcle® and the mean for a population (denoted by 


) isu = Sum of all values in the population 
be /t = Number of values in the population * 


| 


Median 


A number that separates ordered data into halves. Half the values are the 
same number or smaller than the median and half the values are the same 
number or larger than the median. The median may or may not be part of 
the data. 


Mode 


The value that appears most frequently in a set of data. 


Descriptive Statistics: Skewness and the Mean, Median, and Mode 
Consider the following data set: 
45666777777888910 


This data set produces the histogram shown below. Each interval has width 
one and each value is located in the middle of an interval. 


4 5 6 7 8 9 10 


The histogram displays a symmetrical distribution of data. A distribution is 
symmetrical if a vertical line can be drawn at some point in the histogram 
such that the shape to the left and the right of the vertical line are mirror 
images of each other. The mean, the median, and the mode are each 7 for 
these data. In a perfectly symmetrical distribution, the mean and the 
median are the same. This example has one mode (unimodal) and the 
mode is the same as the mean and median. In a symmetrical distribution 
that has two modes (bimodal), the two modes would be different from the 
mean and median. 


The histogram for the data: 
4566677778 


is not symmetrical. The right-hand side seems "chopped off" compared to 
the left side. The shape distribution is called skewed to the left because it is 
pulled out to the left. 


4 5 6 7 8 


The mean is 6.3, the median is 6.5, and the mode is 7. Notice that the 
mean is less than the median and they are both less than the mode. The 
mean and the median both reflect the skewing but the mean more so. 


The histogram for the data: 
677778886910 


is also not symmetrical. It is skewed to the right. 


6 7 8 9 10 


The mean is 7.7, the median is 7.5, and the mode is 7. Of the three statistics, 
the mean is the largest, while the mode is the smallest. Again, the mean 
reflects the skewing the most. 


To summarize, generally if the distribution of data is skewed to the left, the 
mean is less than the median, which is often less than the mode. If the 
distribution of data is skewed to the right, the mode is often less than the 
median, which is less than the mean. 


Skewness and symmetry become important when we discuss probability 
distributions in later chapters. 


Descriptive Statistics: Measuring the Spread of the Data 

Descriptive Statistics: Measuring the Spread of Data explains standard deviation as a measure of variation in data 
and is part of the collection col10555 written by Barbara Illowsky and Susan Dean. Roberta Bloom made 
contributions that helped to clarify the standard deviation and the variance. 


An important characteristic of any set of data is the variation in the data. In some data sets, the data values are 
concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. 
The most common measure of variation, or spread, is the standard deviation. 


The standard deviation is a number that measures how far data values are from their mean. 
The standard deviation 


e provides a numerical measure of the overall amount of variation in a data set 
e can be used to determine whether a particular data value is close to or far from the mean 


The standard deviation provides a measure of the overall variation in a data set 

The standard deviation is always positive or 0. The standard deviation is small when the data are all concentrated 
close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are 
more spread out from the mean, exhibiting more variation. 


Suppose that we are studying waiting times at the checkout line for customers at supermarket A and supermarket 
B; the average wait time at both markets is 5 minutes. At market A, the standard deviation for the waiting time is 2 
minutes; at market B the standard deviation for the waiting time is 4 minutes. 


Because market B has a higher standard deviation, we know that there is more variation in the waiting times at 
market B. Overall, wait times at market B are more spread out from the average; wait times at market A are more 
concentrated near the average. 


The standard deviation can be used to determine whether a data value is close to or far from the mean. 
Suppose that Rosa and Binh both shop at Market A. Rosa waits for 7 minutes and Binh waits for 1 minute at the 
checkout counter. At market A, the mean wait time is 5 minutes and the standard deviation is 2 minutes. The 
standard deviation can be used to determine whether a data value is close to or far from the mean. 


Rosa waits for 7 minutes: 


e 7 is 2 minutes longer than the average of 5; 2 minutes is equal to one standard deviation. 
e Rosa's wait time of 7 minutes is 2 minutes longer than the average of 5 minutes. 
e Rosa's wait time of 7 minutes is one standard deviation above the average of 5 minutes. 


Binh waits for 1 minute. 


e 1is 4 minutes less than the average of 5; 4 minutes is equal to two standard deviations. 

e Binh's wait time of 1 minute is 4 minutes less than the average of 5 minutes. 

e Binh's wait time of 1 minute is two standard deviations below the average of 5 minutes. 

e A data value that is two standard deviations from the average is just on the borderline for what many 
statisticians would consider to be far from the average. Considering data to be far from the mean if it is more 
than 2 standard deviations away is more of an approximate "rule of thumb" than a rigid rule. In general, the 
shape of the distribution of the data affects how much of the data is further away than 2 standard deviations. 
(We will learn more about this in later chapters.) 


The number line may help you understand standard deviation. If we were to put 5 and 7 on a number line, 7 is to 
the right of 5. We say, then, that 7 is one standard deviation to the right of 5 because 
5 + (1)(2) =7. 


If 1 were also part of the data set, then 1 is two standard deviations to the left of 5 because 
5 + (-2)(2) =1. 


—_———— I 
0 123 45 67 


e In general, a value = mean + (#ofSTDEV)(standard deviation) 

e where #0fSTDEVs = the number of standard deviations 

e 7 is one standard deviation more than the mean of 5 because: 7=5+(1)(2) 
e 1is two standard deviations less than the mean of 5 because: 1=5+(—2)(2) 


The equation value = mean + (#ofSTDEVs)(standard deviation) can be expressed for a sample and for a 
population: 


¢ sample: z = x + (##ofSTDEV)(s) 
¢ Population: « = + (##ofSTDEV)(o) 


The lower case letter s represents the sample standard deviation and the Greek letter o (sigma, lower case) 
represents the population standard deviation. 


The symbol a is the sample mean and the Greek symbol yu is the population mean. 


Calculating the Standard Deviation 

If x is a number, then the difference "x - mean" is called its deviation. In a data set, there are as many deviations 
as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong 
to a population, in symbols a deviation is x — yw . For sample data, in symbols a deviation is z— z . 


The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are 
data from a sample. The calculations are similar, but not identical. Therefore the symbol used to represent the 
standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s 
represents the sample standard deviation and the Greek letter o (sigma, lower case) represents the population 
standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate 
of o. 


To calculate the standard deviation, we need to calculate the variance first. The variance is an average of the 
squares of the deviations (the x— x values for a sample, or the x — pu values for a population). The symbol o? 
represents the population variance; the population standard deviation o is the square root of the population 
variance. The symbol s? represents the sample variance; the sample standard deviation s is the square root of the 
sample variance. You can think of the standard deviation as a special average of the deviations. 


If the numbers come from a census of the entire population and not a sample, when we calculate the average of 
the squared deviations to find the variance, we divide by N, the number of items in the population. If the data are 
from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n- 
1, one less than the number of items in the sample. You can see that in the formulas below. 


Formulas for the Sample Standard Deviation 
"Sie _ 2 2 
» = [BEE ore = EE 
e For the sample standard deviation, the denominator is n-1, that is the sample size MINUS 1. 
Formulas for the Population Standard Deviation 


2 2 
ome J FEW) op g = J Et(e-w) 
¢ For the population standard deviation, the denominator is N, the number of items in the population. 


In these formulas, f represents the frequency with which a value appears. For example, if a value appears once, f 
is 1. If a value appears three times in the data set or population, f is 3. 


Sampling Variability of a Statistic 

The statistic of a sampling distribution was discussed in Descriptive Statistics: Measuring the Center of the 
Data. How much the statistic varies from one sample to another is known as the sampling variability of a 
statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of 
the mean is an example of a standard error. It is a special standard deviation and is known as the standard 
deviation of the sampling distribution of the mean. You will cover the standard error of the mean in The Central 


Limit Theorem (not now). The notation for the standard error of the mean is a where o is the standard 


deviation of the population and n is the size of the sample. 


Note: In practice, USE A CALCULATOR OR COMPUTER SOFTWARE TO CALCULATE THE 
STANDARD DEVIATION. If you are using a TI-83,83+,84+ calculator, you need to select the appropriate 
standard deviation o, or s; from the summary statistics. We will concentrate on using and interpreting the 
information that the standard deviation gives us. However you should study the following step-by-step example to 
help you understand how the standard deviation measures variation from the mean. 


Example: 

In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages 
of her students. The following data are the ages fora SAMPLE of n = 20 fifth grade students. The ages are 
rounded to the nearest half year: 

8) 925 G5 IO WO) WO) WO) WO 3 WO! MOS Oss) Hi TL a AT aU aL TL. '55 TL 5) TL 5) 

Equation: 


9+95x2+10x4+10.5x4+11x6411.5~x3 
— 


= 10.525 
20 


The average age is 10.53 years, rounded to 2 places. 
The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square 
root of the variance. We will explain the parts of the table after calculating s. 


Data Freq. Deviations Deviations?” (Freq.)(Deviations’) 

x ud (x — x) (esa Ce =a) 

9 1 9 — 10.525 = —1.525 (—1.525)? = 2.325625 1 x 2.325625 = 2.325625 
9.5 2 9.5 — 10.525 = —1.025 (—1.025)” = 1.050625 2 x 1.050625 = 2.101250 
10 4 10 — 10.525 = —0.525 (—0.525)” = 0.275625 4 x .275625 = 1.1025 
10.5 4 10.5 — 10.525 = —0.025 (—0.025)” = 0.000625 4 x .000625 = .0025 

11 6 11 — 10.525 = 0.475 (0.475)? = 0.225625 6 x .225625 = 1.35375 


Data Freq. Deviations Deviations? (Freq.)(Deviations’) 


11.5 3 11.5 — 10.525 = 0.975 (0.975)” = 0.950625 3 x .950625 = 2.851875 


The sample variance, s”, is equal to the sum of the last column (9.7375) divided by the total number of data 
values minus one (20 - 1): 

s? = S85 — 0.5125 

The sample standard deviation s is equal to the square root of the sample variance: 

s = V0.5125 =. 0715891 Rounded to two decimal places, s = 0.72 

Typically, you do the calculation for the standard deviation on your calculator or computer. The 
intermediate results are not rounded. This is done for accuracy. 

Exercise: 


Problem: Verify the mean and standard deviation calculated above on your calculator or computer. 


Solution: 
Using the TI-83,83+,84+ Calculators 


e Enter data into the list editor. Press STAT 1:EDIT. If necessary, clear the lists by arrowing up into the 
name. Press CLEAR and arrow down. 

e Put the data values (9, 9.5, 10, 10.5, 11, 11.5) into list L1 and the frequencies (1, 2, 4, 4, 6, 3) into list 

L2. Use the arrow keys to move around. 

Press STAT and arrow to CALC. Press 1:1-VarStats and enter L1 (2nd 1), L2 (2nd 2). Do not forget the 

comma. Press ENTER. 

e £=10.525 

Use Sx because this is sample data (not a population): Sx=0.715891 


For the following problems, recall that value = mean + (#ofSTDEVs)(standard deviation) 
For a sample: z = x + (#ofSTDEVs)(s) 

For a population: x = p + (#ofSTDEVs)( oc) 

For this example, use xz = x + (#ofSTDEVs)(s) because the data is from a sample 


Exercise: 
Problem: Find the value that is 1 standard deviation above the mean. Find (x + 1s). 


Solution: 


(x + 1s) = 10.53 + (1)(0.72) = 11.25 


Exercise: 
Problem: Find the value that is two standard deviations below the mean. Find (x — 2s). 


Solution: 


(x — 2s) = 10.53 — (2)(0.72) = 9.09 


Exercise: 


Problem: Find the values that are 1.5 standard deviations from (below and above) the mean. 


Solution: 


° (x —1.5s) = 10.53 — (1.5)(0.72) = 9.45 
° (« +1.5s) = 10.53 + (1.5)(0.72) = 11.61 


Explanation of the standard deviation calculation shown in the table 

The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than 
is the data value 11. The deviations 0.97 and 0.47 indicate that. A positive deviation occurs when the data value is 
greater than the mean. A negative deviation occurs when the data value is less than the mean; the deviation is 
-1.525 for the data value 9. If you add the deviations, the sum is always zero. (For this example, there are n=20 
deviations.) So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you 
make them positive numbers, and the sum will also be positive. The variance, then, is the average squared 
deviation. 


The variance is a squared measure and does not have the same units as the data. Taking the square root solves the 
problem. The standard deviation measures the spread in the same units as the data. 


Notice that instead of dividing by n=20, the calculation divided by n-1=20-1=19 because the data is a sample. For 
the sample variance, we divide by the sample size minus one (n-1). Why not divide by n? The answer has to do 
with the population variance. The sample variance is an estimate of the population variance. Based on the 
theoretical mathematics that lies behind these calculations, dividing by (n-1) gives a better estimate of the 
population variance. 


Note: Your concentration should be on what the standard deviation tells us about the data. The standard deviation 
is a number which measures how far the data are spread from the mean. Let a calculator or computer do the 
arithmetic. 


The standard deviation, s or a, is either zero or larger than zero. When the standard deviation is 0, there is no 
spread; that is, the all the data values are equal to each other. The standard deviation is small when the data are all 
concentrated close to the mean, and is larger when the data values show more variation from the mean. When the 
standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make s 
or o very large. 


The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better "feel" 
for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation 
can be very helpful but in skewed distributions, the standard deviation may not be much help. The reason is that 
the two sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look at the 
first quartile, the median, the third quartile, the smallest value, and the largest value. Because numbers can be 
confusing, always graph your data. 


Note:The formula for the standard deviation is at the end of the chapter. 


Example: 
Exercise: 


Problem: Use the following data (first exam scores) from Susan Dean's spring pre-calculus class: 


3342494953555561 6367686869697273 7478808388888890 929494949496100 


e aCreate a chart containing the data, frequencies, relative frequencies, and cumulative relative 
frequencies to three decimal places. 
e bCalculate the following to one decimal place using a TI-83+ or TI-84 calculator: 


iThe sample mean 

iiThe sample standard deviation 
iiiThe median 

ivThe first quartile 

vthe third quartile 

vilQR 


©). (Gb XO} ey 6) 


e cConstruct a box plot and a histogram on the same set of axes. Make comments about the box plot, the 
histogram, and the chart. 


Solution: 
ea 

Data Frequency Relative Frequency Cumulative Relative Frequency 
38) 1 0.032 0.032 
42 1 0.032 0.064 
49 2, 0.065 0.129 
53 1 0.032 0.161 
55 2: 0.065 0.226 
61 il 0.032 0.258 
63 1 0.032 0.29 
67 1 0.032 0.322 
68 2 0.065 0.387 
69 2 0.065 0.452 
72 il 0.032 0.484 
73 1 0.032 0.516 
74 1 0.032 0.548 
78 1 0.032 0.580 


80 1 0.032 0.612 


Data Frequency Relative Frequency Cumulative Relative Frequency 


83 1 0.032 0.644 

88 3 0.097 0.741 

90 1 0.032 O).7/7/3! 

92 il 0.032 0.805 

94 4 0.129 0.934 

96 il 0.032 0.966 

100 il 0.032 0.998 (Why isn't this value 1?) 
eb 

o iThe sample mean = 73.5 

o iiThe sample standard deviation = 17.9 

o iiiThe median = 73 

© ivThe first quartile = 61 

o vThe third quartile = 90 

© vilIQR = 90 - 61 = 29 


e cThe x-axis goes from 32.5 to 100.5; y-axis goes from -2.4 to 15 for the histogram; number of intervals 
is 5 for the histogram so the width of an interval is (100.5 - 32.5) divided by 5 which is equal to 13.6. 
Endpoints of the intervals: starting point is 32.5, 32.5+13.6 = 46.1, 46.1+13.6 = 59.7, 59.7+13.6 = 73.3, 
73.3+13.6 = 86.9, 86.9+13.6 = 100.5 = the ending value; No data values fall on an interval boundary. 


The long left whisker in the box plot is reflected in the left side of the histogram. The spread of the exam scores in 
the lower 50% is greater (73 - 33 = 40) than the spread in the upper 50% (100 - 73 = 27). The histogram, box plot, 
and chart all reflect this. There are a substantial number of A and B grades (80s, 90s, and 100). The histogram 
clearly shows this. The box plot shows us that the middle 50% of the exam scores (IQR = 29) are Ds, Cs, and Bs. 
The box plot also shows us that the lower 25% of the exam scores are Ds and Fs. 


Comparing Values from Different Data Sets 
The standard deviation is useful when comparing data values that come from different data sets. If the data sets 
have different means and standard deviations, it can be misleading to compare the data values directly. 


e For each data value, calculate how many standard deviations the value is away from its mean. 
e Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve for #ofSTDEVs. 
#ofSTDEVs _ value—mean 


standard deviation 
¢ Compare the results of this calculation. 


#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, the formulas become: 


= — &£-2z 
Sample L=XH+ZS a ae 
: _ _ op 
Population L=p~+zo z= 
Example: 
Exercise: 
Problem: 


Two students, John and Ali, from different high schools, wanted to find out who had the highest G.P.A. when 
compared to his school. Which student had the highest G.P.A. when compared to his school? 


Student GPA School Mean GPA School Standard Deviation 
John 2.85 3.0 0.7 
Ali 77 80 10 

Solution: 


For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from the average, 
for his school. Pay careful attention to signs when comparing and interpreting the answer. 


#ofSTDEVs = value—mean [c= TU 


standard deviation o 
For John, z = ##ofSTDEVs = 243-3° — —0.21 
For Ali, z = #ofSTDEVs = Ts = —0.3 


John has the better G.P.A. when compared to his school because his G.P.A. is 0.21 standard deviations below 
his school's mean while Ali's G.P.A. is 0.3 standard deviations below his school's mean. 


John's z-score of —0.21 is higher than Ali's z-score of —0.3 . For GPA, higher values are better, so we 
conclude that John has the better GPA when compared to his school. 


The following lists give a few facts that provide a little more insight into what the standard deviation tells us about 
the distribution of the data. 
For ANY data set, no matter what the distribution of the data is: 


e At least 75% of the data is within 2 standard deviations of the mean. 

e At least 89% of the data is within 3 standard deviations of the mean. 

e At least 95% of the data is within 4 1/2 standard deviations of the mean. 
e This is known as Chebyshev's Rule. 


For data having a distribution that is MOUND-SHAPED and SYMMETRIC: 


e Approximately 68% of the data is within 1 standard deviation of the mean. 

e Approximately 95% of the data is within 2 standard deviations of the mean. 

¢ More than 99% of the data is within 3 standard deviations of the mean. 

e This is known as the Empirical Rule. 

e It is important to note that this rule only applies when the shape of the distribution of the data is mound- 
shaped and symmetric. We will learn more about this when studying the "Normal" or "Gaussian" probability 
distribution in later chapters. 


**With contributions from Roberta Bloom 


Glossary 


Standard Deviation 
A number that is equal to the square root of the variance and measures how far data values are from their 
mean. Notation: s for sample standard deviation and o for population standard deviation. 


Variance 
Mean of the squared deviations from the mean. Square of the standard deviation. For a set of data, a deviation 
can be represented as x — x where z is a value of the data and z is the sample mean. The sample variance is 
equal to the sum of the squares of the deviations divided by the difference of the sample size and 1. 


Descriptive Statistics: Summary of Formulas 
A summary of useful formulas used in examining descriptive statistics 
Commonly Used Symbols 


e The symbol »’ means to add or to find the sum. 

e n= the number of data values in a sample 

e N =the number of people, things, etc. in the population 
e «x =the sample mean 

e s =the sample standard deviation 

e y= the population mean 

e o =the population standard deviation 

e f = frequency 

e x = numerical value 


Commonly Used Expressions 


e «*f =A value multiplied by its respective frequency 

e \\ a = The sum of the values 

e \\a*f = The sum of values multiplied by their respective frequencies 

e (x — x) or (x — p) = Deviations from the mean (how far a value is 
from the mean) 

e (x — x)” or (x — p)” = Deviations squared 


f(x - a)? or f(x — py)? = The deviations squared and multiplied by 
their frequencies 


Mean Formulas: 


° 2 =2t or x = aft 
de Life 


Standard Deviation Formulas: 


Formulas Relating a Value, the Mean, and the Standard Deviation: 


e value = mean + (#ofSTDEVs)(standard deviation), where #ofSTDEVs 
= the number of standard deviations 

e x = x£+ (HofSTDEVs)(s) 

e x = p+ (#ofSTDEVs)(oc) 


Descriptive Statistics: Practice 1 

This module provides students with opportunities to apply concepts related 
to descriptive statistics. Students are asked to take a set of sample data and 
calculate a series of statistical values for that data. 


Student Learning Outcomes 


e The student will calculate and interpret the center, spread, and location 
of the data. 
e The student will construct and interpret histograms an box plots. 


Given 


Sixty-five randomly selected car salespersons were asked the number of 
cars they generally sell in one week. Fourteen people answered that they 
generally sell three cars; nineteen generally sell four cars; twelve generally 
sell five cars; nine generally sell six cars; eleven generally sell seven cars. 


Complete the Table 


Cumulative 
Data Value Relative Relative 
(# cars) Frequency Frequency Frequency 


Discussion Questions 


Exercise: 


Problem: What does the frequency column sum to? Why? 


Solution: 
65 


Exercise: 


Problem: What does the relative frequency column sum to? Why? 


Solution: 


1 
Exercise: 


Problem: 


What is the difference between relative frequency and frequency for 
each data value? 


Exercise: 


Problem: 


What is the difference between cumulative relative frequency and 
relative frequency for each data value? 


Enter the Data 


Enter your data into your calculator or computer. 


Construct a Histogram 


Determine appropriate minimum and maximum x and y values and the 
scaling. Sketch the histogram below. Label the horizontal and vertical axes 
with words. Include numerical scaling. 


Data Statistics 


Calculate the following values: 
Exercise: 


Problem: Sample mean = x = 


Solution: 


4.75 


Exercise: 


Problem: Sample standard deviation = s, = 


Solution: 


1.39 


Exercise: 


Problem: Sample size = n = 


Solution: 


65 


Calculations 


Use the table in section 2.11.3 to calculate the following values: 
Exercise: 


Problem: Median = 


Solution: 
4 
Exercise: 


Problem: Mode = 


Solution: 
4 
Exercise: 


Problem: First quartile = 


Solution: 
4 
Exercise: 


Problem: Second quartile = median = 50th percentile = 


Solution: 


4 


Exercise: 


Problem: Third quartile = 


Solution: 
6 


Exercise: 


Problem: Interquartile range (IQR) = en 
Solution: 
6—4=2 

Exercise: 


Problem: 10th percentile = 


Solution: 
3 
Exercise: 


Problem: 70th percentile = 

Solution: 

6 

Exercise: 

Problem: Find the value that is 3 standard deviations: 
e aAbove the mean 
e bBelow the mean 

Solution: 


e a8.93 


e b0.58 


Box Plot 


Construct a box plot below. Use a ruler to measure and scale accurately. 


Interpretation 


Looking at your box plot, does it appear that the data are concentrated 
together, spread out evenly, or concentrated in some areas, but not in 
others? How can you tell? 


Descriptive Statistics: Homework 

Descriptive Statistics: Homework is part of the collection col10555 written 
by Barbara Illowsky and Susan Dean and provides homework questions 
related to lessons about descriptive statistics. 

Exercise: 


Problem: 


Twenty-five randomly selected students were asked the number of 
movies they watched the previous week. The results are as follows: 


Cumulative 

# of Relative Relative 
movies Frequency Frequency Frequency 
0 s) 

1 9 

2 6 

3 4 

4 1 


e aFind the sample mean x 

e bFind the sample standard deviation, s 
e cConstruct a histogram of the data. 

e dComplete the columns of the chart. 

e eFind the first quartile. 

e fFind the median. 

e gFind the third quartile. 

e hConstruct a box plot of the data. 


e iWhat percent of the students saw fewer than three movies? 
e jFind the 40th percentile. 

e kFind the 90th percentile. 

e IConstruct a line graph of the data. 

e mConstruct a stem plot of the data. 


Solution: 


al.48 
b1.12 
el 
fl 


e 180% 
e jl 
e k3 


Exercise: 


Problem: 


The median age for U.S. blacks currently is 30.9 years; for U.S. whites 
it is 42.3 years. ((Source: 
http://www.usatoday.com/news/nation/story/2012-05-17/minority- 
births-census/55029100/1)) 


e aBased upon this information, give two reasons why the black 
median age could be lower than the white median age. 

¢ bDoes the lower median age for blacks necessarily mean that 
blacks die younger than whites? Why or why not? 


¢ cHow might it be possible for blacks and whites to die at 
approximately the same age, but for the median age for whites to 
be higher? 


Exercise: 
Problem: 
Forty randomly selected students were asked the number of pairs of 


sneakers they owned. Let X = the number of pairs of sneakers owned. 
The results are as follows: 


Relative Cumulative Relative 

xX Frequency Frequency Frequency 

1 2 

2 5 

3 8 

4 12 

is) 12 

7 1 


e aFind the sample mean x 

e bFind the sample standard deviation, s 
e cConstruct a histogram of the data. 

e dComplete the columns of the chart. 

e eFind the first quartile. 


fFind the median. 

gFind the third quartile. 

hConstruct a box plot of the data. 

iWhat percent of the students owned at least five pairs? 
e jFind the 40th percentile. 

e kFind the 90th percentile. 

e IConstruct a line graph of the data 

e mConstruct a stem plot of the data 


Solution: 


e a3.78 
e b1.29 
e e3 
e £4 
e go 


e 132.5% 
e j4 
e k5 
Exercise: 
Problem: 
600 adult Americans were asked by telephone poll, What do you think 
constitutes a middle-class income? The results are below. Also, include 


left endpoint, but not the right endpoint. (Source: Time magazine; 
survey by Yankelovich Partners, Inc.) 


Note:"Not sure" answers were omitted from the results. 


Salary ($) Relative Frequency 
< 20,000 0.02 
20,000 - 25,000 0.09 
25,000 - 30,000 0.19 
30,000 - 40,000 0.26 
40,000 - 50,000 0.18 
50,000 - 75,000 0.17 
75,000 - 99,999 0.02 
100,000+ 0.01 


e aWhat percent of the survey answered "not sure" ? 
e bWhat percent think that middle-class is from $25,000 - $50,000 
i) 


e cConstruct a histogram of the data 


1. iShould all bars have the same width, based on the data? 
Why or why not? 

2. litHow should the <20,000 and the 100,000+ intervals be 
handled? Why? 


e dFind the 40th and 80th percentiles 
e eConstruct a bar graph of the data 


Exercise: 


Problem: 


Following are the published weights (in pounds) of all of the team 
members of the San Francisco 49ers from a previous year (Source: San 
Jose Mercury News) 


177 205 210 210 232 205 185 185 178 210 206 212 184 174 185 242 
188 212 215 247 241 223 220 260 245 259 278 270 280 295 275 285 
290 272 273 280 285 286 200 215 185 230 250 241 190 260 250 302 
265 290 276 228 265 


aOrganize the data from smallest to largest value. 

bFind the median. 

cFind the first quartile. 

dFind the third quartile. 

eConstruct a box plot of the data. 

fThe middle 50% of the weights are from to 

gIf our population were all professional football players, would 
the above data be a sample of weights or the population of 
weights? Why? 

hIf our population were the San Francisco 49ers, would the above 
data be a sample of weights or the population of weights? Why? 
iAssume the population was the San Francisco 49ers. Find: 


ithe population mean, j. 

iithe population standard deviation, o. 

iiithe weight that is 2 standard deviations below the mean. 
ivWhen Steve Young, quarterback, played football, he 
weighed 205 pounds. How many standard deviations above 
or below the mean was he? 


QO -O -9:-0 


jThat same year, the mean weight for the Dallas Cowboys was 
240.08 pounds with a standard deviation of 44.38 pounds. Emmit 
Smith weighed in at 209 pounds. With respect to his team, who 


was lighter, Smith or Young? How did you determine your 
answer? 


Solution: 


e b241 
e ¢205.5 
e d272.5 
ee 


174 205.5 241 272.5 302 


e £205.5, 272.5 
e gsample 

e hpopulation 
ej 


1236.34 

1137.50 

i1161.34 

iv0.84 std. dev. below the mean 


o Oo 0 90 


e jyYoung 


Exercise: 


Problem: 


An elementary school class ran 1 mile with a mean of 11 minutes and a 
standard deviation of 3 minutes. Rachel, a student in the class, ran 1 
mile in 8 minutes. A junior high school class ran 1 mile with a mean of 
9 minutes and a standard deviation of 2 minutes. Kenji, a student in the 
class, ran 1 mile in 8.5 minutes. A high school class ran 1 mile with a 
mean of 7 minutes and a standard deviation of 4 minutes. Nedda, a 
student in the class, ran 1 mile in 8 minutes. 


e aWhy is Kenji considered a better runner than Nedda, even 
though Nedda ran faster than he? 

e bWho is the fastest runner with respect to his or her class? 
Explain why. 


Exercise: 


Problem: 


In a survey of 20 year olds in China, Germany and America, people 
were asked the number of foreign countries they had visited in their 
lifetime. The following box plots display the results. 


China 


ee a ee 


0 1 2 3 4 3 6 7 8 9 10 11 


e aln complete sentences, describe what the shape of each box plot 
implies about the distribution of the data collected. 

e bExplain how it is possible that more Americans than Germans 
surveyed have been to over eight foreign countries. 

e cCompare the three box plots. What do they imply about the 
foreign travel of twenty year old residents of the three countries 
when compared to each other? 


Exercise: 


Problem: 


One hundred teachers attended a seminar on mathematical problem 
solving. The attitudes of a representative sample of 12 of the teachers 
were measured before and after the seminar. A positive number for 
change in attitude indicates that a teacher's attitude toward math 
became more positive. The twelve change scores are as follows: 


38-1205-31-165-2 


e aWhat is the mean change score? 

¢ bWhat is the standard deviation for this population? 

e cWhat is the median change score? 

e dFind the change score that is 2.2 standard deviations below the 
mean. 


Exercise: 


Problem: 


Three students were applying to the same graduate school. They came 
from schools with different grading systems. Which student had the 
best G.P.A. when compared to his school? Explain how you 
determined your answer. 


School Ave. School Standard 


Student G.P.A. G.P.A. Deviation 
Thuy 2) 3.2 0.8 
Vichet 87 75 20 
Kamala 8.6 8 0.4 
Solution: 
Kamala 
Exercise: 


Problem: Given the following box plot: 


0 2 10 12 13 


e aWhich quarter has the smallest spread of data? What is that 
spread? 

e bWhich quarter has the largest spread of data? What is that 
spread? 

e cFind the Inter Quartile Range (IQR). 

e dAre there more data in the interval 5 - 10 or in the interval 10 - 
13? How do you know this? 

e eWhich interval has the fewest data in it? How do you know this? 


I 0-2 
II2-4 
TIT10-12 
IV12-13 


o Oo 0 0 


Exercise: 


Problem: Given the following box plot: 


0 20 100 150 


e aThink of an example (in words) where the data might fit into the 
above box plot. In 2-5 sentences, write down the example. 

e bWhat does it mean to have the first and second quartiles so close 
together, while the second to fourth quartiles are far apart? 


Exercise: 
Problem: 


Santa Clara County, CA, has approximately 27,873 Japanese- 
Americans. Their ages are as follows. (Source: West magazine) 


Age Group Percent of Community 
0-17 18.9 

18-24 8.0 

25-34 22.8 

35-44 15.0 


45-54 13.1 


Age Group Percent of Community 
55-64 11.9 


65+ 10.3 


e aConstruct a histogram of the Japanese-American community in 
Santa Clara County, CA. The bars will not be the same width for 
this example. Why not? 

e¢ bWhat percent of the community is under age 35? 

e cWhich box plot most resembles the information above? 


0 24 34 53 =100 
ii. 
0 18 34 45 =100 
iii. 
0 24 25 54 =100 
Exercise: 
Problem: 


Suppose that three book publishers were interested in the number of 
fiction paperbacks adult consumers purchase per month. Each 
publisher conducted a survey. In the survey, each asked adult 
consumers the number of fiction paperbacks they had purchased the 
previous month. The results are below. 


# of books Freq. Rel. Freq. 


0 10 
1 12 
2 16 
3 12 
4 8 
5 6 
6 yi 
8 2 
Publisher A 

# of books Freq. Rel. Freq. 
0 18 
1 24 
2 24 
3 22 


# of books Freq. Rel. Freq. 


5 10 
7 5 
9 1 
Publisher B 
# of books Freq. Rel. Freq. 
0-1 20 
2-3 35 
4-5 12 
6-7 2 
8-9 1 
Publisher C 


e aFind the relative frequencies for each survey. Write them in the 
charts. 

e bUsing either a graphing calculator, computer, or by hand, use the 
frequency column to construct a histogram for each publisher's 
survey. For Publishers A and B, make bar widths of 1. For 
Publisher C, make bar widths of 2. 

e cIn complete sentences, give two reasons why the graphs for 
Publishers A and B are not identical. 


e dWould you have expected the graph for Publisher C to look like 
the other two graphs? Why or why not? 

e eMake new histograms for Publisher A and Publisher B. This 
time, make bar widths of 2. 

e fNow, compare the graph for Publisher C to the new graphs for 
Publishers A and B. Are the graphs more similar or more 
different? Explain your answer. 


Exercise: 


Problem: 


Often, cruise ships conduct all on-board transactions, with the 
exception of gambling, on a cashless basis. At the end of the cruise, 
guests pay one bill that covers all on-board transactions. Suppose that 
60 single travelers and 70 couples were surveyed as to their on-board 
bills for a seven-day cruise from Los Angeles to the Mexican Riviera. 
Below is a summary of the bills for each group. 


Amount($) 
51-100 
101-150 
151-200 
201-250 
251-300 


301-350 


Frequency Rel. Frequency 
5 

10 

15 

15 


10 


Singles 


Amount($) Frequency Rel. Frequency 
100-150 fs) 
201-250 is) 
251-300 fs) 
301-350 is) 
351-400 10 
401-450 10 
451-500 10 
501-550 10 
551-600 fs) 
601-650 fs) 
Couples 


e aFill in the relative frequency for each group. 

e bConstruct a histogram for the Singles group. Scale the x-axis by 
$50. widths. Use relative frequency on the y-axis. 

e cConstruct a histogram for the Couples group. Scale the x-axis by 
$50. Use relative frequency on the y-axis. 

e dCompare the two graphs: 


© iList two similarities between the graphs. 
o jiList two differences between the graphs. 
o iiiOverall, are the graphs more similar or different? 


e eConstruct a new graph for the Couples by hand. Since each 
couple is paying for two individuals, instead of scaling the x-axis 
by $50, scale it by $100. Use relative frequency on the y-axis. 

e fCompare the graph for the Singles with the new graph for the 
Couples: 


© iList two similarities between the graphs. 
© iiOverall, are the graphs more similar or different? 


e iBy scaling the Couples graph differently, how did it change the 
way you compared it to the Singles? 

e jBased on the graphs, do you think that individuals spend the 
Same amount, more or less, as singles as they do person by person 
in a couple? Explain why in one or two complete sentences. 


Exercise: 
Problem: 
Refer to the following histograms and box plot. Determine which of 


the following are true and which are false. Explain your solution to 
each part in complete sentences. 


1 3 6 


aThe medians for all three graphs are the same. 

bWe cannot determine if any of the means for the three graphs is 
different. 

cThe standard deviation for (b) is larger than the standard 
deviation for (a). 

dWe cannot determine if any of the third quartiles for the three 
graphs is different. 


Solution: 
e alrue 
e bTrue 


e clrue 
e dFalse 


Exercise: 


Problem: Refer to the following box plots. 


Data 1 

0 ys 4 i | 
Data 2 

0 2 7 


e aln complete sentences, explain why each statement is false. 


°o iData 1 has more data values above 2 than Data 2 has above 
2, 

o jiThe data sets cannot have the same mode. 

o iiiFor Data 1, there are more data values below 4 than there 
are above 4. 


¢ bFor which group, Data 1 or Data 2, is the value of “7” more 
likely to be an outlier? Explain why in complete sentences 


Exercise: 


Problem: 


In a recent issue of the IEEE Spectrum, 84 engineering conferences 
were announced. Four conferences lasted two days. Thirty-six lasted 
three days. Eighteen lasted four days. Nineteen lasted five days. Four 
lasted six days. One lasted seven days. One lasted eight days. One 
lasted nine days. Let X = the length (in days) of an engineering 
conference. 


e aOrganize the data in a chart. 

e bFind the median, the first quartile, and the third quartile. 

e cFind the 65th percentile. 

e dFind the 10th percentile. 

e eConstruct a box plot of the data. 

e fThe middle 50% of the conferences last from days to 

days. 

e gCalculate the sample mean of days of engineering conferences. 

e hCalculate the sample standard deviation of days of engineering 
conferences. 

e iFind the mode. 

e jlf you were planning an engineering conference, which would 
you choose as the length of the conference: mean; median; or 
mode? Explain why you made that choice. 

e kGive two reasons why you think that 3 - 5 days seem to be 
popular lengths of engineering conferences. 


Solution: 


e b4,3,5 
e c4 
e d3 
ee 


° £3,5 
g3.94 
h1.28 
i3 
jmode 


Exercise: 


Problem: 


A survey of enrollment at 35 community colleges across the United 
States yielded the following figures (source: Microsoft Bookshelf): 


6414 1550 2109 9350 21828 4300 5944 5722 2825 2044 5481 5200 
9853 2750 10012 6357 27000 9414 7681 3200 17500 9200 7380 
18314 6557 13713 17768 7493 2771 2861 1263 7285 28165 5080 
11622 


e aOrganize the data into a chart with five intervals of equal width. 
Label the two columns "Enrollment" and "Frequency." 

¢ bConstruct a histogram of the data. 

e clf you were to build a new community college, which piece of 
information would be more valuable: the mode or the mean? 

e dCalculate the sample mean. 

e eCalculate the sample standard deviation. 

e fA school with an enrollment of 8000 would be how many 
standard deviations away from the mean? 


Exercise: 


Problem: 


The median age of the U.S. population in 1980 was 30.0 years. In 
1991, the median age was 33.1 years. (Source: Bureau of the Census) 


e¢ aWhat does it mean for the median age to rise? 

e bGive two reasons why the median age could rise. 

e cFor the median age to rise, is the actual number of children less 
in 1991 than it was in 1980? Why or why not? 


Solution: 


e cMaybe 


Exercise: 


Problem: 


A survey was conducted of 130 purchasers of new BMW 3 series cars, 
130 purchasers of new BMW 5 series cars, and 130 purchasers of new 
BMW 7 series cars. In it, people were asked the age they were when 
they purchased their car. The following box plots display the results. 


BMW 3 series 


25 30 35 40 45 50 55 60 65 70 75 80 


e aln complete sentences, describe what the shape of each box plot 
implies about the distribution of the data collected for that car 


series. 

e bWhich group is most likely to have an outlier? Explain how you 
determined that. 

e cCompare the three box plots. What do they imply about the age 
of purchasing a BMW from the series when compared to each 
other? 

e dLook at the BMW 5 series. Which quarter has the smallest 
spread of data? What is that spread? 

e eLook at the BMW 5 series. Which quarter has the largest spread 
of data? What is that spread? 

e fLook at the BMW 5 series. Estimate the Inter Quartile Range 
(IQR). 

e gLook at the BMW 5 series. Are there more data in the interval 
31-38 or in the interval 45-55? How do you know this? 

e hLook at the BMW 5 series. Which interval has the fewest data in 
it? How do you know this? 


° 431-35 
© 1138-41 
° 11141-64 


Exercise: 
Problem: 


The following box plot shows the U.S. population for 1990, the latest 
available year. (Source: Bureau of the Census, 1990 Census) 


0 17 33 50 #105 


e aAre there fewer or more children (age 17 and under) than senior 
citizens (age 65 and over)? How do you know? 

e b12.6% are age 65 and over. Approximately what percent of the 
population are of working age adults (above age 17 to age 65)? 


Solution: 


e amore children 
e b62.4% 


Exercise: 


Problem: 


Javier and Ercilia are supervisors at a shopping mall. Each was given 
the task of estimating the mean distance that shoppers live from the 
mall. They each randomly surveyed 100 shoppers. The samples 
yielded the following information: 


Javier Ercilla 
x 6.0 miles 6.0 miles 
S 4.0 miles 7.0 miles 


e aHow can you determine which survey was correct ? 
e bExplain what the difference in the results of the surveys implies 


about the data. 
clf the two histograms depict the distribution of values for each 


supervisor, which one depicts Ercilia's sample? How do you 
know? 


e dif the two box plots depict the distribution of values for each 
supervisor, which one depicts Ercilia’s sample? How do you 


know? 
i. ii. 
0 1 6 14 21 0 4 6 9 12 
Exercise: 


Problem: Student grades on a chemistry exam were: 
77, 78, 76, 81, 86, 51, 79, 82, 84, 99 
e aConstruct a stem-and-leaf plot of the data. 


e bAre there any potential outliers? If so, which scores are they? 
Why do you consider them outliers? 


Solution: 


e b51,99 


Try these multiple choice questions (Exercises 24 - 30). 


The next three questions refer to the following information. We are 
interested in the number of years students in a particular elementary 
statistics class have lived in California. The information in the following 
table is from the entire section. 


Number of years Frequency 


Number of years Frequency 


7 | 
14 3 
15 1 
18 1 
19 4 
20 3 
22 1 
23 1 
26 1 
AO 2 
42 2 
Total = 20 
Exercise: 


Problem: What is the IQR? 


e A8 

e Bll 
e C15 
e D35 


Solution: 


A 


Exercise: 


Problem: What is the mode? 


e Al9 

e B19.5 

e C14 and 20 
e D22.65 


Solution: 
A 
Exercise: 
Problem: Is this a sample or the entire population? 


e Asample 
e Bentire population 
¢ Cneither 


Solution: 
B 


The next two questions refer to the following table. X = the number of 
days per week that 100 clients use a particular exercise facility. 


x Frequency 


0 3 
1 12 
2 33 
3 28 
4 11 
5 9 
6 4 
Exercise: 


Problem: The 80th percentile is: 


e AS 
e B80 
e C3 
e D4 


Solution: 


D 
Exercise: 


Problem: 


The number that is 1.5 standard deviations BELOW the mean is 
approximately: 


e AO.7 


e B48 
e C-2.8 
e DCannot be determined 


Solution: 


A 


The next two questions refer to the following histogram. Suppose one 
hundred eleven people who shopped in a special T-shirt store were asked 
the number of T-shirts they own costing more than $19 each. 


Relative 
Frequency 


40/111 39/111 


30/111 
25/111 


23/111 


20/111 
17/111 


10/111 


Number of T-shirts costing more than $19 each 


Exercise: 


Problem: 


The percent of people that own at most three (3) T-shirts costing more 
than $19 each is approximately: 


e A21 
e B59 
e C41 
e DCannot be determined 


Solution: 


S 
Exercise: 
Problem: 


If the data were collected by asking the first 111 people who entered 
the store, then the type of sampling is: 


e Acluster 

¢ Bsimple random 
e Cstratified 

e Dconvenience 


Solution: 


D 
Exercise: 


Problem: 


Below are the 2010 obesity rates by U.S. states and Washington, 
DC.(Source: http://www.cdc.gov/obesity/data/adult.html)) 


State 
Alabama 
Alaska 


Arizona 


Arkansas 


California 
Colorado 


Connecticut 


Delaware 


Washington, 
DC 


Florida 
Georgia 
Hawaii 
Idaho 


Illinois 


Indiana 


Percent 
(%) 


32.2 
24.5 


24.3 


30.1 


24.0 
21.0 


2225 


28.0 


22.2 


26.6 
29.6 
22.7 
26.5 


28.2 


29.6 


State 
Montana 
Nebraska 
Nevada 


New 
Hampshire 


New Jersey 
New Mexico 
New York 


North 
Carolina 


North 
Dakota 


Ohio 
Oklahoma 
Oregon 
Pennsylvania 
Rhode Island 


South 
Carolina 


Percent 
(%) 


23.0 
26.9 


22.4 


25.0 


23.8 
25.1 


23.9 


27.8 


272 


20.2 
30.4 
26.8 
28.6 


25:0 


31.5 


Percent Percent 


State (%) State (%) 
Iowa 28.4 =e 2753 
Kansas 29.4 Tennessee 30.8 
Kentucky 31.3 Texas 31.0 
Louisiana 31.0 Utah 225 
Maine 26.8 Vermont 2a 
Maryland 27 Virginia 26.0 
Massachusetts 23.0 Washington 25.5 
Michigan 30.9 aah 32.5 
Minnesota 24.8 Wisconsin 26.3 
Mississippi 34.0 Wyoming 25.1 
Missouri 30.5 


e a.Construct a bar graph of obesity rates of your state and the four 
States closest to your state. Hint: Label the x-axis with the states. 

e b.Use a random number generator to randomly pick 8 states. 
Construct a bar graph of the obesity rates of those 8 states. 

e c.Construct a bar graph for all the states beginning with the letter 
"AM 

e d.Construct a bar graph for all the states beginning with the letter 


Solution: 


Example solution for b using the random number generator for the Ti- 
84 Plus to generate a simple random sample of 8 states. Instructions 
are below. 


e Number the entries in the table 1 - 51 (Includes Washington, DC; 
Numbered vertically) 

e Press MATH 

e Arrow over to PRB 

e Press 5:randInt( 

e Enter 51,1,8) 


Eight numbers are generated (use the right arrow key to scroll through 
the numbers). The numbers correspond to the numbered states (for this 
example: {47 21 9 23 51 13 25 4}. If any numbers are repeated, 
generate a different number by using 5:randInt(51,1)). Here, the states 
(and Washington DC) are {Arkansas, Washington DC, Idaho, 
Maryland, Michigan, Mississippi, Virginia, Wyoming}. Corresponding 
percents are {28.7 21.8 24.5 26 28.9 32.8 25 24.6}. 


40 


35 


30 


25 

Percent (%) 20 
15 

10 

5 


ie) 
Arkansas WashDC Idaho Maryland Michigan Mississippi Virginia Wyoming 


Exercise: 


Problem: 


A music school has budgeted to purchase 3 musical instruments. They 
plan to purchase a piano costing $3000, a guitar costing $550, anda 
drum set costing $600. The mean cost for a piano is $4,000 with a 
standard deviation of $2,500. The mean cost for a guitar is $500 with a 
standard deviation of $200. The mean cost for drums is $700 with a 
standard deviation of $100. Which cost is the lowest, when compared 
to other instruments of the same type? Which cost is the highest when 
compared to other instruments of the same type. Justify your answer 
numerically. 


Solution: 


For pianos, the cost of the piano is 0.4 standard deviations BELOW 
the mean. For guitars, the cost of the guitar is 0.25 standard deviations 
ABOVE the mean. For drums, the cost of the drum set is 1.0 standard 
deviations BELOW the mean. Of the three, the drums cost the lowest 
in comparison to the cost of other instruments of the same type. The 
guitar cost the most in comparison to the cost of other instruments of 
the same type. 


Exercise: 
Problem: 
Suppose that a publisher conducted a survey asking adult consumers 
the number of fiction paperback books they had purchased in the 
previous month. The results are summarized in the table below. (Note 


that this is the data presented for publisher B in homework exercise 
13); 


# of books Freq. Rel. Freq. 


# of books Freq. Rel. Freq. 


0 18 

1 24 

2 24 

3 22 

4 15 

5 10 

7 5 

9 1 

Publisher B 
a. Are there any outliers in the data? Use an appropriate numerical 


Lary 


test involving the IQR to identify outliers, if any, and clearly state 
your conclusion. 


. If a data value is identified as an outlier, what should be done 


about it? 


. Are any data values further than 2 standard deviations away from 


the mean? In some situations, statisticians may use this criteria to 
identify data values that are unusual, compared to the other data 
values. (Note that this criteria is most appropriate to use for data 
that is mound-shaped and symmetric, rather than for skewed 
data.) 


. Do parts (a) and (c) of this problem give the same answer? 
. Examine the shape of the data. Which part, (a) or (c), of this 


question gives a more appropriate result for this data? 


. Based on the shape of the data which is the most appropriate 


measure of center for this data: mean, median or mode? 


Solution: 


e IQR=4-1=3; Q1—1.5*IQR = 1 —1.5(8) = -3.5; Q3 + 
1.5*IQR = 4 + 1.5(3) = 8.5 ;The data value of 9 is larger than 8.5. 
The purchase of 9 books in one month is an outlier. 

e The outlier should be investigated to see if there is an error or 
some other problem in the data; then a decision whether to 
include or exclude it should be made based on the particular 
situation. If it was a correct value then the data value should 
remain in the data set. If there is a problem with this data value, 
then it should be corrected or removed from the data. For 
example: If the data was recorded incorrectly (perhaps a 9 was 
miscoded and the correct value was 6) then the data should be 
corrected. If it was an error but the correct value is not known it 
should be removed from the data set. 

e xbar— 2s = 2.45 — 2*1.88 = -1.31 ; xbar + 2s = 2.45 + 2*1.88 = 
6.21 ; Using this method, the five data values of 7 books 
purchased and the one data value of 9 books purchased would be 
considered unusual. 

e No: part (a) identifies only the value of 9 to be an outlier but part 
(c) identifies both 7 and 9. 

e The data is skewed (to the right). It would be more appropriate to 
use the method involving the IQR in part (a), identifying only the 
one value of 9 books purchased as an outlier. Note that part (c) 
remarks that identifying unusual data values by using the criteria 
of being further than 2 standard deviations away from the mean is 
most appropriate when the data are mound-shaped and 
symmetric. 

e The data are skewed to the right. For skewed data it is more 
appropriate to use the median as a measure of center. 


**Fxercises 32 and 33 contributed by Roberta Bloom 


